Best Practices for Crowdsourcing Dialectal Arabic Speech Transcription

نویسندگان

  • Samantha Wray
  • Hamdy Mubarak
  • Ahmed M. Ali
چکیده

In this paper, we investigate different approaches in crowdsourcing transcriptions of Dialectal Arabic speech with automatic quality control to ensure good transcription at the source. Since Dialectal Arabic has no standard orthographic representation, it is very challenging to perform quality control. We propose a complete recipe for speech transcription quality control that includes using output of an Automatic Speech Recognition system. We evaluated the quality of the transcribed speech and through this recipe, we achieved a reduction in transcription error of 1.0% compared with 13.2% baseline with no quality control for Egyptian data, and down to 4% compared with 7.8% for the North African dialect.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Crowdsource a little to label a lot: labeling a speech corpus of dialectal Arabic

Arabic is a language with great dialectal variety, with Modern Standard Arabic (MSA) being the only standardized dialect. Spoken Arabic is characterized by frequent code-switching between MSA and Dialectal Arabic (DA). DA varieties are typically differentiated by region, but despite their wide-spread usage, they are under-resourced and lack viable corpora and tools necessary for speech recognit...

متن کامل

Dialectal Arabic Telephone Speech Corpus: Principles, Tool design, and Transcription Conventions

The present paper presents the experience gained at LDC in the collection and transcription of a corpus of conversational telephone speech in dialectal Arabic. The paper will cover the following: (a) Arabic language background; (b) objectives, principles, and methodological choices of dialectal Arabic transcription, (c) conceptualization and design features of LDC’s ‘Arabic Multi-Dialectal Tran...

متن کامل

Dialectal Arabic Orthography-based Transcription

The present paper describes the experience gained at LDC in the collection and transcription of conversational dialectal Arabic. The paper will cover the following: (a) Arabic language background; (b) objectives. principles, and methodological choices of dialectal Arabic transcription, (c) design features of LDC‟s „Arabic MultiDialectal Transcription Tool‟ (AMADAT) and metalanguage transcriptio...

متن کامل

Automatic Diacritization Of Arabic For Acoustic Modeling In Speech Recognition

Automatic recognition of Arabic dialectal speech is a challenging task because Arabic dialects are essentially spoken varieties. Only few dialectal resources are available to date; moreover, most available acoustic data collections are transcribed without diacritics. Such a transcription omits essential pronunciation information about a word, such as short vowels. In this paper we investigate v...

متن کامل

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition

Dialectal Arabic speech recognition is a difficult problem and is relatively less studied. In this paper, we propose a cross-dialectal Gaussian mixture model training criteria to transfer knowledge from one domain to the other by data sharing. Specifically, phone classification experiments on West Point Modern Standard Arabic Speech corpus and Babylon Levantine Arabic Speech corpus demonstrate ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015